Improving Language Models for Mandarin Conversational Speech Recognition with Web Data

نویسندگان

  • Tim Ng
  • Mari Ostendorf
  • Mei-Yuh Hwang
  • Manhung Siu
  • Ivan Bulyko
  • Xin Lei
چکیده

Lack of data is a problem in training language models for conversational speech recognition, particularly for languages other than English. Experiments in English have successfully used webbased text collection targeted for a conversational style to augment small sets of transcribed speech; here we look at extending these techniques to Mandarin. In addition, we investigate different techniques for topic adaptation. Experiments in recognizing Mandarin telephone conversations show that use of filtered web data leads to a 28% reduction in perplexity and 7% reduction in character error rate, with most of the gain due to the general filtered web data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Porting the Galaxy System to Mandarin Chinese

Galaxy is a human-computer conversational system that provides a spoken language interface for accessing on-line information. It was initially implemented for English in travel-related domains, including air travel, local city navigation, and weather. Efforts were started to develop multilingual systems within the framework of galaxy several years ago. This thesis focuses on developing the Mand...

متن کامل

Studies on Training Text Selection for Conversational Finnish Language Modeling

Current ASR and MT systems do not operate on conversational Finnish, because training data for colloquial Finnish has not been available. Although speech recognition performance on literary Finnish is already quite good, those systems have very poor baseline performance in conversational speech. Text data for relevant vocabulary and language models can be collected from the Internet, but web da...

متن کامل

Integration of phonetic and prosodic information for robust utterance verification - Vision, Image and Signal Processing, IEE Proceedings-

Mandarin speech is known for its tonal charactcristic, and prosodic information plays an important role in Mandarin speech recognition. Driven by this propcrty, phonetic and prosodic information are integrated and used for Mandarin telephone speech keyword spotting. A two-stage strategy, with recognition followed by verification, is adopted. For keyword recognition, 132 subsyllable models, two ...

متن کامل

Improving speech recognition and keyword search for low resource languages using web data

We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conver...

متن کامل

Development of the Cuhtk 2004 Rt04f Mandarin Conversational Telephone Speech Transcription System

This paper describes the development of the CUHTK 2004 Mandarin conversational telephone speech transcription system. The paper details all aspects of the system, but concentrates on the development of the acoustic models. As there are significant differences between the available training corpora, both in terms of topics of conversation and accents, forms of data normalisation and adaptive tra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004